arXiv:1507.07598vl [math.OC] 27 Jul 2015 


The Proximal Distance Algorithm 
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Abstract. The MM principle is a device for creating optimization algorithms satis¬ 
fying the ascent or descent property. The current survey emphasizes the role of the 
MM principle in nonlinear programming. For smooth functions, one can construct an 
adaptive interior point method based on scaled Bregman barriers. This algorithm does 
not follow the central path. For convex programming subject to nonsmooth constraints, 
one can combine an exact penalty method with distance majorization to create versa¬ 
tile algorithms that are effective even in discrete optimization. These proximal distance 
algorithms are highly modular and reduce to set projections and proximal mappings, 
both very well-understood techniques in optimization. We illustrate the possibilities in 
linear programming, binary piecewise-linear programming, nonnegative quadratic pro¬ 
gramming, Iq regression, matrix completion, and inverse sparse covariance estimation. 
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1. Introduction 

The MM principle is a device for constructing optimization algorithms [H [13 HSJ 
113 ISO]- In essence, it replaces the objective function f{x) by a simpler surrogate 
function g[x \ a;„) anchored at the current iterate and majorizing or minorizing 
f{x). As a byproduct of optimizing g{x \ Xn) with respect to x, the objective 
function f{x) is sent downhill or uphill, depending on whether the purpose is 
minimization or maximization. The next iterate a;„+i is chosen to optimize the 
surrogate g{x \ Xn) subject to any relevant constraints. Majorization combines 
two conditions: the tangency condition g{xn \ Xn) = fixn) and the domination 
condition g{x \ a;„) > f{x) for all x. In minimization these conditions and the 
definition of Xn+i lead to the descent property 

f{Xn+i) < g{Xn+i I Xn) < g{Xn \ Xn) = /(*„). 

Minorization reverses the domination inequality and produces an ascent algorithm. 
Under appropriate regularity conditions, an MM algorithm is guaranteed to con¬ 
verge to a stationary point of the objective function [3D]- From the perspective of 
dynamical systems, the objective function serves as a Liapunov function for the 
algorithm map. 

The MM principle simplifies optimization by: (a) separating the variables of 
a problem, (b) avoiding large matrix inversions, (c) linearizing a problem, (d) 


2 


Kenneth Lange and Kevin L. Keys 


restoring symmetry, (e) dealing with equality and inequality constraints gracefully, 
and (f) turning a nondifferentiable problem into a smooth problem. Choosing a 
tractable surrogate function g{x \ x^) that hugs the objective function f{x) as 
tightly as possible requires experience and skill with inequalities. The majorization 
relation between functions is closed under the formation of sums, nonnegative 
products, limits, and composition with an increasing function. Hence, it is possible 
to work piecemeal in majorizing complicated objective functions. 

It is impossible to do justice to the complex history of the MM principle in a 
paragraph. The celebrated EM (expectation-maximization) principle of computa¬ 
tional statistics is a special case of the MM principle [33]. Specific MM and EM 
algorithms appeared years before the principle was well understood [22l 132] jSS] 
|40l ST] . The widely applied projected gradient and proximal gradient algorithms 
can be motivated from the MM perspective, but the early emphasis on operators 
and fixed points obscured this distinction. Although Dempster, Laird, and Rubin 
|15j formally named the EM algorithm, many of their contributions were antici¬ 
pated by Baum [1] and Sundberg [39]. The MM principle was clearly stated by 
Ortega and Rheinboldt [36]. de Leeuw m is generally credited with recognizing 
the importance of the principle in practice. The EM algorithm had an immediate 
and large impact in computational statistics. The more general MM principle was 
much slower to take hold. The papers [a HU [26] by the Dutch school of psycho¬ 
metricians solidified its position. (In this early literature the MM principle is called 
iterative majorization.) The related Dinklebach m maneuver in fractional linear 
programming also highlighted the importance of the descent property in algorithm 
construction. 

Before moving on, let us record some notational conventions. All vectors and 
matrices appear in boldface. The * superscript indicates a vector or matrix trans¬ 
pose. The Euclidean norm of a vector x is denoted by ||a:|| and the Frobenius 
norm of a matrix M by ||7Vf||i?. For a smooth real-valued function f{x), we write 
its gradient (column vector of partial derivatives) as Vf{x), its first differential 
(row vector of derivatives) as df{x) = \7f{x)*, and its second differential (Hessian 
matrix) as cPf{x). 


2. An Adaptive Barrier Method 

In convex programming it simplifies matters notationally to replace a convex in¬ 
equality constraint hj{x) < 0 by the concave constraint Vj(x) = —hj{x) > 0. 
Barrier methods operate on the relative interior of the feasible region where all 
Vj{x) > 0. Adding an appropriate barrier term to the objective function f{x) 
keeps an initially inactive constraint Vj(x) inactive throughout an optimization 
search. If the barrier function is well designed, it should adapt and permit conver¬ 
gence to a feasible point y with one or more inequality constraints active. 

We now briefly summarize an adaptive barrier method that does not follow 
the central path [27]. Because the logarithm of a concave function is concave, the 
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Bregman majorization [7] 


- In Vj (x) + In Vj (Xn) + 



dVj{Xn){x - £C„) > 0 


acts as a convex barrier for a smooth constraint Vj(x) > 0. To make the bar¬ 
rier adaptive, we scale it by the current value Vj{xn) of the constraint. These 
considerations suggest an MM algorithm based on the surrogate function 


g{x I Xn) = f{x) - p'^Vj{Xn)lnVj{x) + p'^dVj{Xn){x - Xn) 
i=i t=i 


for s inequality constraints. Minimizing the surrogate subject to relevant linear 
equality constraints Ax = b produces the next iterate Xn+i- The constant p 
determines the tradeoff between keeping the constraints inactive and minimizing 
f{x). One can show that the MM algorithm with exact minimization converges to 
the constrained minimum of f{x) |30j . 

In practice one step of Newton’s method is usually adequate to decrease f{x). 
The hrst step of Newton’s method minimizes the second-order Taylor expansion of 
g{x I Xn) around Xn subject to the equality constraints. Given smooth functions, 
the two differentials 


dg{Xn I Xn) = df{Xn) 


d^g{Xn I Xn) = d?f{Xn) - p'^d^V^iXn) 

t=l 




i{Xn) 


VVj{Xn)dVj{Xn) 


( 1 ) 


are the core ingredients in the quadratic approximation of g{x \ a;„). Unfortu¬ 
nately, one step of Newton’s method is neither guaranteed to decrease f[x) nor to 
respect the nonnegativity constraints. 

Example 1. Adaptive Barrier Method for Linear Programming 

For instance, the standard form of linear programming requires minimizing 
a linear function f{x) = c*x subject to Ax = b and a; > 0. The quadratic 
approximation to the surrogate g{x \ Xn) amounts to 


p 

C Xn C (a? Xn) T “ ^ ^ (^Xj Xnj) ■ 
3 = 1 ■ 


'"nj 


The minimum of this quadratic subject to the linear equality constraints occurs at 
the point 


Xn+l — Xn — Dn C -|- Dn A (ADn -^ ) ^ ~ AXn + ADn c). 
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Here is the diagonal matrix with *th diagonal entry px^l, and the increment 
Xn+i — Xn satisfies the linear equality constraint A[xn+i — Xn) = h — Axn- □ 

One can overcome the objections to Newton updates by taking a controlled 
step along the Newton direction = Xn+i — Xn- The key is to exploit the theory 
of self-concordant functions 0135]. A thrice differentiable convex function h{t) is 
said to be self-concordant if it satisfies the inequality 


for some constant c > 0 and all t in the essential domain of h{t). All con¬ 
vex quadratic functions qualify as self-concordant with c = 0. The function 
hit) = — ln(a< -I- b) is self-concordant with constant 1. The class of self-concordant 
functions is closed under sums and composition with linear functions. A con¬ 
vex function k{x) with domain Rp is said to be self-concordant if every slice 
hit) = k{x + tu) is self-concordant. 

Rather than conduct an expensive one-dimensional search along the Newton 
direction Xn+tu^ one can majorize the surrogate function h{t) = g(xn+tu„ \ x„) 
along the half-line t > 0. The clever majorization 

h{t) < h(0) + h'{0)t - -h”{0)^/H - 4 ln[l - cth"( 0 ) 1 / 2 ] (2) 

c 

serves the dual purpose of guaranteeing a decrease in f{x) and preventing a vi¬ 
olation of the inequality constraints [35]. Here c is the self-concordance constant 
associated with the surrogate. The optimal choice of t reduces to the damped 
Newton update 


^ h'jO) 

/i"(0)-c/i'(0)h"(0)i/2' 
The first two derivatives of h{t) are clearly 


(3) 


h'{0) = df{Xn)Un 

S 

h"{0) = U*^Cpf{Xn)Un - p'^ U^d'^Vj {Xn)Un 

J = 1 

1 

+ , [dVj{Xn)Un]^. 


The first of these derivatives is nonpositive because Un is a descent direction for 
fix). The second is generally positive because all of the contributing terms are 
nonnegative. 

When f{x) is quadratic and the inequality constraints are affine, detailed calcu¬ 
lations show that the surrogate function g{x \ Xn) is self-concordant with constant 

1 

c = =. 

A/pmin{ui(a;„),... ,Us(a:n)} 
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Iteration n 

No Safeguard 

Self-concordant Safeguard 

C*Xn 

l|A„|| 

C*Xn 

l|A„|| 


1 

-1.20000 

0.25820 

-1.11270 

0.14550 

0.56351 

2 

-1.33333 

0.17213 

-1.20437 

0.11835 

0.55578 

3 

-1.41176 

0.10125 

-1.27682 

0.09353 

0.55026 

4 

-1.45455 

0.05523 

-1.33288 

0.07238 

0.54630 

5 

-1.47692 

0.02889 

-1.37561 

0.05517 

0.54345 

10 

-1.49927 

0.00094 

-1.47289 

0.01264 

0.53746 

15 

-1.49998 

0.00003 

-1.49426 

0.00271 

0.53622 

20 

-1.50000 

0.00000 

-1.49879 

0.00057 

0.53597 

25 

-1.50000 

0.00000 

-1.49975 

0.00012 

0.53591 

30 

-1.50000 

0.00000 

-1.49995 

0.00003 

0.53590 

35 

-1.50000 

0.00000 

-1.49999 

0.00001 

0.53590 

40 

-1.50000 

0.00000 

-1.50000 

0.00000 

0.53590 


Table 1. Performance of the adaptive barrier method in linear programming. 


Taking the damped Newton’s step with step length dS]) keeps Xn + tnUn in the 
relative interior of the feasible region while decreasing the surrogate and hence 
the objective function f{x). When f{x) is not quadratic but can be majorized 
by a quadratic q{x \ a;„), one can replace f{x) by q{x \ Xn) in calculating the 
adaptive-barrier update. The next iterate Xn+i retains the descent property. 

As a toy example consider the linear programming problem of minimizing c*x 
subject to Ax = b and x > 0. Applying the adaptive barrier method to the 
choices 


2 

0 

0 

1 

0 

o\ 



o 

II 

2 

0 

0 

1 

0 , 

b= 1 , 

c 

\0 

0 

2 

0 

0 

1/ 

\l 



" V \V 0 

voy 

and to the feasible initial point ccq = ^1 produces the results displayed in Table 
[TJ Not shown is the minimum point (i, i, i, 0, 0,0)*. Columns two and three 
of the table record the progress of the unadorned adaptive barrier method. The 
quantity || A„|| equals the Euclidean norm of the difference vector A^, = a;„ — 
Columns four and five repeat this information for the algorithm modified by the 
self-concordant majorization The quantity in column six represents the 
optimal step length ([3|) in going from Xn-i to Xn along the Newton direction 
Un-i- Clearly, there is a price to be paid in implementing a safeguarded Newton 
step. In practice, this price is well worth paying. 
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3. Distance Majorization 


On a Euclidean space, the distance to a closed set S' is a Lipschitz function 
dist(a;,S) with Lipschitz constant 1. If S is also convex, then dist(a:,S) is a 
convex function. Projection onto S is intimately tied to dist(a:,S). Unless S is 
convex, the projection operator Ps{x) is multi-valued for at least one argument 
X. Fortunately, it is possible to majorize dist(a;,S) at a;„ by \\x — Ps(a;„)||. This 
simple observation is the key to the proximal distance algorithm to be discussed 
later. In the meantime, let us show how to derive two feasibility algorithms by 
distance majorization [9]. Let Si,., Sm be closed sets. The method of averaged 
projections attempts to find a point in their intersection S = To derive 

the algorithm, consider the convex combination 

m 

f{x) — ^ aj dist(a;, Sj)^ 
i=i 

of squared distance functions. Obviously, f{x) vanishes precisely on S when all 
aj > 0. The majorization 

m 

g{x I Xn) = ^aj\\x- Ps^{Xn)f 
of f{x) is easily minimized. The minimum point of g{x \ a;„), 

m 

Xn+l — 'y ' OLjPg^ (^Xn), 
i=i 

defines the averaged operator. The MM principle guarantees that £c„+i decreases 
the objective function. 

Von Neumann’s method of alternating projections can also be derived from 
this perspective. For two sets Si and S 2 , consider the problem of minimizing the 
objective function f{x) = dist(a;, S' 2 )^ subject to the constraint x G Si. The 
function 


g{x I Xn) = ||a: - Ps 2 {xnW 


majorizes f{x). Indeed, the domination condition g{x \ x^) > f{x) holds because 
PSii'^n) belongs to S' 2 ; the tangency condition g[xn \ Xn) = f{xn) holds because 
Ps^i^n) is the closest point in S '2 to Xn. The surrogate function g{x \ a;„) is 
minimized subject to the constraint by taking Xn+i = Psi o Ps^i^n)- The MM 
principle again ensures that Xn+i decreases the objective function. When the two 
sets intersect, the least distance of 0 is achieved at any point in the intersection. 
One can extend this derivation to three sets by minimizing the objective function 
f{x) = dist(a;, S 2 )^ -I- dist(a;, Sa)^ subject to a; G Si. The surrogate 


g{x I a;„) = ||a; - -h ||a; - Ps^ixr, 


= 2 




+ Cn 
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relies on an irrelevant constant c„. The closest point in Si is 


Xn+l = PSi 



This construction clearly generalizes to more than three sets. 


4. The Proximal Distance Method 

We now turn to an exact penalty method that applies to nonsmooth functions. 
Clarke’s exact penalty method m turns the constrained problem of minimizing 
a function f{y) over a closed set S into the unconstrained problem of minimizing 
the function f{y) +pdist(y, S) for p sufficiently large. Here is a precise statement 
of a generalization of Clarke’s result [siiiniin]. 

Proposition 1. Suppose f{y) achieves a local minimum on S at the point x. 
Let 4>s{y) denote a function that vanishes on S and satisfies 4>siy) > cdist(y, S') 
for all X and some positive constant c. If f{y) is locally Lipschitz around x with 
constant L, then for every p > c~^L, Fp{y) = f{y) + p(psiy) achieves a local 
unconstrained minimum at x. 

Classically the choice 4>s{x) = dist(a;,S) was preferred. For affine equality 
constraints gi{x) = 0 and affine inequality constraints hj{x) < 0, Hoffman’s bound 



applies, where t is some positive constant, S is the feasible set where G{y) = 0, 
and H{y)+ < 0 [53]. The vector H{y)+ has components hj{x)+ = max{hj{y),0}. 
When S is the intersection of several closed sets Si,..., Sm, tbe alternative 


m 



(4) 


is attractive. The next proposition gives sufficient conditions under which the 
crucial bound (j)s{y) > cdist(y, S) is valid for the function (|4|). 

Proposition 2. Suppose Si,..., Sm are closed convex sets in Rp with the first j 
sets polyhedral. Assume further that the intersection 


5 = (nLi5.)n(n-,-+iri s.) 


is nonempty and bounded. Then there exists a constant r > 0 such that 


m 


m 


dist(a:, S) < r HistYfr < T^frn I N dist(a;, Sf)^ 



for all X. The sets Si,..., Sm are said to be linearly regular. 
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Proof. See the references HHS] for all details. A polyhedral set is the nonempty 
intersection of a finite number of half-spaces. The operator ri K forms the relative 
interior of the convex set K, namely, the interior of K relative to the affine hull of 
K. When K is nonempty, its relative interior is nonempty and generates the same 
affine hull as K itself. □ 


In general, we will require f{x) and (fsix) to be continuous functions and the 
sum Fp{y) = f{y) + ptfsiy) to be coercive for some value p = po- It then follows 
that Fp{y) is coercive and attains its minimum for all p > po- One can prove 
a partial converse to Clarke’s theorem mm- This requires the enlarged set 
S'e = {cc : 4>s{x) < e} of points lying close to S as measured by (j)s{x). 

Proposition 3. Suppose that f{y) is Lipschitz on Se for some e > 0. Then 
under the stated assumptions on f{x) and 4>s{x), a global minimizer of Fp{y) is 
a constrained minimizer of f{y) for all suffieiently large p. 

When the constraint set S is compact and f{y) has a continuously varying local 
Lipschitz constant, the hypotheses of Proposition [3] are fulfilled. This is the case, 
for instance, when f{y) is continuously differentiable. With this background on the 
exact penalty method in mind, we now sketch an approximate MM algorithm for 
convex programming that is motivated by distance majorization. This algorithm is 
designed to exploit set projections and proximal maps. The proximal map piox^ly) 
associated with a convex function h{x) satisfies 


prox^(y) = argmin,^ 


+ ^\\y 



A huge literature and software base exist for computing projections and proximal 
maps [3]. 

Since the function dist(a;, S) is merely continuous, we advocate approximating 
it by the differentiable function 


diste(a;,S') = \/dist(a;, 5)^ -I- e 


for e > 0 small. The composite function diste(a;,5') is convex when S is convex 
because the function 'fF + e is increasing and convex on [0, oo). Instead of mini¬ 
mizing f(x) + pdist{x, S), we suggest minimizing the differentiable convex function 
f{x) pdist^{x, S) by an MM algorithm. Regardless of whether S is convex, the 
majorization 


diste(a;,S') < ^/\\x - Ps{xn)P -+- e (5) 

holds. If S is nonconvex, there may be a multiplicity of closest points, and one 
must choose a representative of the set Ps{xn). In any event one can invoke the 
univariate majorization 




t tn 
2-\/tra 


(6) 
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of the concave function y/i on the interval t > 0 and majorize the majorization ([5|) 
by 


y/\\x - Ps{XnW + e < 




1 

PsiXnW + e 


||a; 


Ps{XnW + Cn 


for some irrelevant constant c„. The second step of our proposed MM algorithm 
consists of minimizing the surrogate function 

on 

g{x I Xri) = f{x) + -^Wx - Ps{XnW 


V\\Xn - PsiXnW + e 


The corresponding proximal map drives f{x) + pdiste(a;, S') downhill. Under the 
more general exact penalty o, the surrogate function depends on a sum of spher¬ 
ical quadratics rather than a single spherical quadratic. 

It is possible to project onto a variety of closed nonconvex sets. For example, if 
S is the set of integers, then projection amounts to rounding. An ambiguous point 
n + ^ can be projected to either n or n + 1. Projection onto a finite set simply 
tests each point separately. Projection onto a Cartesian product is achieved via the 
Cartesian product of the projections. One can also project onto many continuous 
sets of interest. For example, to project onto the closed set of points having at most 
k nonzero coordinates, one zeros out all but the k largest coordinates in absolute 
value. Projection onto the sphere of center z and radius r takes y ^ z into the 
point z -I- \\yLz\\ (y ~ points of the sphere are equidistant from its center. 

By definition the update Xn+i = prox^-ij[Ps(a;„)] minimizes g(x | a;„). We 
will refer to this MM algorithm as the proximal distance algorithm. It enjoys 
several virtues. First, it allows one to exploit the extensive body of results on 
proximal maps and projections. Second, it does not demand that the constraint set 
S be convex. Third, it does not require the objective function f(x) to be convex 
or smooth. Finally, the minimum values and minimum points of the functions 
f(x) + pdist(a;, S) and f(x) -|- pdiste(a;, S) are close when e > 0 is small. 

In implementing the proximal distance algorithm, the constants L and e must 
specified. For many norms the Lipschitz constant L is known. For a differentiable 
function f(x), the mean value inequality suggests taking L equal to the maximal 
value of IIV/(a:) II in a neighborhood of the optimal point. In specific problems a 
priori bounds can be derived. If no such prior bound is known, then one has to 
guess an appropriate p and see if it leads to a constrained minimum. If not, p should 
be systematically increased until a constrained minimum is reached. Even with a 
justihable bound, it is prudent to start p well below its intended upper bound to 
emphasize minimization of the loss function in early iterations. Experience shows 
that gradually decreasing e is also a good tactic; otherwise, one again runs the 
risk of putting too much early stress on satisfying the constraints. In practice 
the sequences = minja^po) Pmax} and = max{/3“”eo, Cmin} work well for 
a and P slightly larger than 1, say 1.2, and po = eo = 1. On many problems 
more aggressive choices of a and /3 are possible. The values of Pmax and Cmin are 
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problem specific, but taking pmax substantially greater than a known Lipschitz 
constant slows convergence. Taking Cniin too large leads to a poor approximate 
solution. 


5. Sample Problems 


We now explore some typical applications of the proximal distance algorithm. In 
all cases we are able to establish local Lipschitz constants. Comparisons with 
standard optimization software serve as performance benchmarks. 

Example 2. Projection onto an Intersection of Closed Convex Sets 

Let Si,... ,Sk be closed convex sets, and assume that projection onto each Sj 
is straightforward. Dykstra’s algorithm [igiiH] is designed to find the projection 
of an external point y onto S = n^^-^^Sj. The proximal distance algorithm provides 
an alternative based on the convex function 

f{x) = ^/\\x-y\\'^+ S 

for 6 positive, say <5 = 1. The choice f{x) is preferable to the obvious choice 
\\x — yp because f{x) is Lipschitz with Lipschitz constant 1. In the proximal 
distance algorithm, we take 


(t>s{x) = 

and minimize the surrogate function 

k 


^dist(x,5'j)2 
\ 1=1 


g{x I Xr^) = f{x) + + ^\\X-Pn\\^ + Cn 

1=1 


where is the projection of a;„ onto Sj, p„ is the average of the projections p„ 
Cn is an irrelevant constant, and 


After rearrangement, the stationarity condition for optimality reads 
X = {1 - a)y + ap^, a= — 


kw„ 


^Wx-vW^+s 


kw„ 


In other words, x^+i is a convex combination of y and p„. 

To calculate the optimal coefficient a, we minimize the convex surrogate 

kw„ 


h{a) = g[{l — a)y + ap^ \ x^] = \/+ 6 H-^(1 “ a)^d^ + Cn 
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Iteration n 

Dykstra 

Proximal Distance 

^nl 

^n2 

^nl 

^n2 

0 

-1.00000 

2.00000 

-1.00000 

2.00000 

1 

-0.44721 

0.89443 

-0.44024 

1.60145 

2 

0.00000 

0.89443 

-0.25794 

1.38652 

3 

-0.26640 

0.96386 

-0.16711 

1.25271 

4 

0.00000 

0.96386 

-0.11345 

1.16647 

5 

-0.14175 

0.98990 

-0.07891 

1.11036 

10 

0.00000 

0.99934 

-0.01410 

1.01576 

15 

-0.00454 

0.99999 

-0.00250 

1.00257 

20 

0.00000 

1.00000 

-0.00044 

1.00044 

25 

-0.00014 

1.00000 

-0.00008 

1.00008 

30 

0.00000 

1.00000 

-0.00001 

1.00001 

35 

0.00000 

1.00000 

0.00000 

1.00000 


Table 2. Dykstra’s algorithm versus the proximal distance algorithm. 


for d = ||y — p„||. Its derivative 

h'{a) = = — kwn{^ — a)(f 

satisfies /i'(0) < 0 and h'{l) > 0 and possesses a unique root on the open interval 
(0,1). This root can be easily computed by bisection or Newton’s method. 

Table [2] compares Dykstra’s algorithm and the proximal distance algorithm on 
a simple planar example. Here is the closed unit ball in R^, and S 2 is the closed 
halfspace with a;i > 0. The intersection S reduces to the right half ball centered at 
the origin. The table records the iterates of the two algorithms from the starting 
point ajQ = (—1,2)* until their eventual convergence to the geometrically obvious 
solution (0,1)*. In the proximal distance method we set = 2 and aggressively 
e„ = 4“". The two algorithms exhibit similar performance but take rather different 
trajectories. □ 

Example 3. Binary Piecewise-Linear Functions 

The problem of minimizing the binary piecewise-linear function 

f{x) = ^ w^j\xi b*X 

i<j 


subject to X G {0,1}'^ and nonnegative weights Wij is a typical discrete optimiza¬ 
tion problem with applications in graph cuts. If we invoke the majorization 


Xi - 


Xi — 


\xi - Xj\ < 
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prior to applying the proximal operator, then the proximal distance algorithm 
separates the parameters. Parameter separation promotes parallelization and ben¬ 
efits from a fast algorithm for computing proximal maps in one dimension. The 
one-dimensional algorithm is similar to but faster than bisection m- Finally, the 
objective function is Lipschitz with the explicit constant 


i y j^i 

This assertion follows from the simple bound 

\fix) - f{y)\ - J/jl + \b*ix - y)\ 

i 

+ 11^11 ■ \\^-y\\ 

i y 

under the symmetry convention Wij = Wji. 


Dimension 

CPU times 

MM CVX 

Iterations 

2 

0.038 

0.080 

9 

4 

0.052 

0.060 

18 

8 

2.007 

0.050 

200 

16 

2.416 

0.100 

200 

32 

2.251 

0.130 

200 

64 

4.134 

0.400 

200 

128 

0.212 

2.980 

32 

256 

0.868 

62.63 

200 

512 

68.27 

1534 

200 

1024 

526.6 

* 

200 

2048 

127.2 

* 

200 

4096 

547.4 

* 

200 


Table 3. CPU times in seconds and MM iterations until convergence for binary piecewise 
linear functions. Asterisks denote computer runs exceeding computer memory limits. 
Iterations were capped at 200. 

Table [3] displays the numerical results for a few typical examples. For each 
dimension d we filled b with standard normal deviates and the upper triangle of 
the weight matrix W with the absolute values of such deviates. The lower triangle 
of W was determined by symmetry. Small values of b often lead to degenerate 
solutions X with all entries 0 or 1. To avoid this possibility, we multiplied each 
entry of b by d. In the graph cut context, a degenerate solution corresponds to 
no cuts at all or a completely cut graph. These examples depend on the schedules 
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Pn = min{1.2”,L} and e„ = max{1.2 ", 10 for the two tuning constants and 
the local Lipschitz constant 0. 

Although the MM proximal distance algorithm makes good progress towards 
the minimum in the first 100 iterations, it sometimes hovers around its limit with¬ 
out fully converging. This translates into fickle compute times, and for this reason 
we capped the number of MM iterations at 200. For small dimensions MM can 
be much slower than CVX. Fortunately, the performance of the MM algorithm 
improves markedly as d increases. In all runs the two algorithms reach the same 
solution after rounding components to the nearest integer. MM also requires much 
less storage than CVX. Asterisks appear in the table where CVX demanded more 
memory than our laptop computer could deliver. □ 

Example 4. Nonnegative Quadratic Programming 

The proximal distance algorithm is applicable in minimizing a convex quad¬ 
ratic f{x) = ^x*Ax + b*X subject to the constraint a; > 0. In this nonnegative 
quadratic programming program, let be the projection of the current iterate 
Xn onto S' = . If we define the weight 

^_ P _ 

Vl|a:n-y„|P + e’ 

then the next iterate can be expressed as 


Xn+1 = {A + Wnl) ^{uJuVn-b). 

The multiple matrix inversions implied by the update can be avoided by extracting 
and caching the spectral decomposition U*DU of A at the start of the algorithm. 
The inverse {A + Wnl)~^ then reduces to U*{D + WnI)~^U . The diagonal matrix 
D + Wnl is obviously trivial to invert. The remaining operations in computing 
Xn+i collapse to matrix times vector multiplications. Nonnegative least squares is 
a special case of nonnegative quadratic programming. 

One can estimate an approximate Lipschitz constant for this problem. Note 
that /(O) = 0 and that 


f{x) > - ||b|| ■ ||a:||, 

where Amin is the smallest eigenvalue of A. It follows that any point x with 
||a;|| > j^||6|| cannot minimize f{x) subject to the nonnegativity constraint. On 
the other hand, the gradient of f{x) satisfies 

||V/(a;)|| < ||A||||a;|| -f ||6|| < Amax||a:|| + ||b||- 

In view of the mean-value inequality, these bounds suggest that 

L = f ^ + l) ||6|| = [ 2 cond 2 (A) + 1] ||6|| 

\ ^min / 
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CPU times Optima 


d 

MM 

CV 

MA 

YA 

MM 

CV 

MA 

YA 

8 

0.97 

0.23 

0.01 

0.13 

-0.0172 

-0.0172 

-0.0172 

-0.0172 

16 

0.50 

0.24 

0.01 

0.11 

-1.1295 

-1.1295 

-1.1295 

-1.1295 

32 

0.50 

0.24 

0.01 

0.14 

-1.3811 

-1.3811 

-1.3811 

-1.3811 

64 

0.57 

0.28 

0.01 

0.13 

-0.5641 

-0.5641 

-0.5641 

-0.5641 

128 

0.79 

0.36 

0.02 

0.14 

-0.7018 

-0.7018 

-0.7018 

-0.7018 

256 

1.66 

0.65 

0.06 

0.22 

-0.6890 

-0.6890 

-0.6890 

-0.6890 

512 

5.61 

2.95 

0.26 

0.73 

-0.5971 

-0.5968 

-0.5970 

-0.5970 

1024 

32.69 

21.90 

1.32 

2.91 

-0.4944 

-0.4940 

-0.4944 

-0.4944 

2048 

156.7 

178.8 

8.96 

15.89 

-0.4514 

-0.4505 

-0.4512 

-0.4512 

4096 

695.1 

1551 

57.73 

91.54 

-0.4690 

-0.4678 

-0.4686 

-0.4686 


Table 4. CPU times in seconds and optima for the nonnegative quadratic program. 
Abbreviations: d stands for problem dimension, MM for the proximal distance algorithm, 
CV for CVX, MA for MATLAB’s quadprog, and YA for YALMIP. 


provides an approximate Lipschitz constant for f{x) on the region harboring the 
minimum point. This bound on p is usually too large. One remedy is to multiply 
the bound by a deflation factor such as 0.1. Another remedy is to replace the 
covariance A by the corresponding correlation matrix. Thus, one solves the prob¬ 
lem for the preconditioned matrix AD~^, where D is the diagonal matrix 
whose entries are the square roots of the corresponding diagonal entries of A. The 
transformed parameters y = Dx obey the same nonnegativity constraints as x. 

For testing purposes we filled a d x d matrix M with independent standard 
normal deviates and set A = M*M + I. Addition of the identity matrix avoids 
ill conditioning. We also filled the vector b with independent standard normal 
deviates. Our gentle tuning constant schedule e„ = max{1.005“", 10“^®} and 
Pn = min{1.005", 0.1 x L} adjusts p and e so slowly that their limits are not 
actually met in practice. In any event L is the a priori bound for the correlation 
matrix derived from A. Table 0] compares the performance of the MM proximal 
distance algorithm to MATLAB’s quadprog, CVX with the SDPT3 solver, and 
YALMIP with the MOSEK solver. MATLAB’s quadprog is clearly the fastest 
of the four tested methods on these problems. The relative speed of the MM 
algorithm improves as the problem dimension d increases. □ 
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m 

n 

df 

tpi 

tP2 

A 

Li 

L 1 /L 2 

Ti 

T 1 IT 2 

256 

128 

10 

5.97 

3.32 

0.143 

248.763 

0.868 

0.603 

8.098 

128 

256 

10 

3.83 

1.91 

0.214 

106.234 

0.744 

0.999 

10.254 

512 

256 

10 

6.51 

2.88 

0.119 

506.570 

0.900 

0.907 

6.262 

256 

512 

10 

4.50 

1.82 

0.172 

241.678 

0.835 

1.743 

8.687 

1024 

512 

10 

7.80 

5.25 

0.101 

1029.333 

0.921 

2.597 

5.057 

512 

1024 

10 

5.54 

2.58 

0.138 

507.451 

0.881 

8.235 

13.532 

2048 

1024 

10 

8.98 

8.49 

0.080 

2047.098 

0.945 

15.460 

8.858 

1024 

2048 

10 

6.80 

2.93 

0.110 

1044.640 

0.916 

34.997 

18.433 

4096 

2048 

10 

9.75 

9.90 

0.060 

4086.886 

0.966 

89.684 

10.956 

2048 

4096 

10 

8.36 

6.60 

0.086 

2045.645 

0.942 

166.386 

25.821 


Table 5. Numerical experiments comparing MM to MATLAB’s lasso. Each row presents 
averages over 100 independent simulations. Abbreviations: m the number of cases, n the 
number of predictors, df the number of actual predictors in the generating model, tp\ the 
number of true predictors selected by MM, tp 2 the number of true predictors selected by 
the lasso, A the regularization parameter at the lasso optimal loss, Li the optimal loss 
from MM, L 1 /L 2 the ratio of Li to the optimal lasso loss, Ti the total computation time 
in seconds for MM, and T\lT 2 the ratio of Ti to the total computation time of the lasso. 


Example 5. Linear Regression under an £q Constraint 

In this example the objective function is the sum of squares ^||y —where 
y is the response vector, X is the design matrix, and /3 is the vector of regression 
coefficients. The constraint set consists of those j3 with at most k nonzero 
entries. Projection onto the closed but nonconvex set is achieved by zeroing out 
all but the k largest coordinates in absolute value. These coordinates will be unique 
except in the rare circumstances of ties. The proximal distance algorithm for this 
problem coincides with that of the previous problem if we substitute X*X for A, 
—X*y for 6, /3 for x, and the projection operator Pgd for . Better accuracy 
can be maintained if the MM update exploits the singular value decomposition of 
X in forming the spectral decomposition of X*X. Although the proximal distance 
algorithm carries no absolute guarantee of finding the optimal set of k regression 
coefficients, it is far more efficient than sifting through all (^) sets of size k. The 
alternative of lasso-guided model selection must contend with strong shrinkage and 
a surplus of false positives. 

Table [5] compares the MM proximal distance algorithm to MATLAB’s lasso 
function. In simulating data, we filled X with standard normal deviates, set all 
components of /3 to 0 except for /3i = l/i for 1 < i < 10, and added a vector of 
standard normal deviates to Jt/J to determine y. For a given choice of m and n we 
ran each experiment 100 times and averaged the results. The table demonstrates 
the superior speed of the lasso and the superior accuracy of the MM algorithm as 
measured by optimal loss and model selection. □ 
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Example 6. Matrix Completion 

Let Y = {pij) denote a partially observed p x q matrix and A the set of index 
pairs (*,j) with observed. Matrix completion [8] imputes the missing entries 
by approximating Y with a low rank matrix X. Imputation relies on the singular 
value decomposition 

r 

X (8) 

i=l 

where r is the rank of X, the nonnegative singular values Gi are presented in 
decreasing order, the left singular vectors Ui are orthonormal, and the right singular 
vectors Vi are also orthonormal [20]. The set Rk oi px q matrices of rank k or less 
is closed. Projection onto Rk is accomplished by truncating the sum dSj) to 

min{r,fe} 

Pr^{X)= ^ aiUiv\. 

i=l 

When r > k and cr/c+i = Ufc, the projection operator is multi-valued. 

The MM principle allows one to restore the symmetry lost in the missing entries 
[34] . Suppose Xn is the current approximation to X. One simply replaces a 
missing entry of Y for {i,j) ^ A by the corresponding entry Xnij of Xn and 
adds the term ^{xnij — XijY to the least squares criterion 

/(^) = 2 ~ ■ 

Since the added terms majorize 0, they create a legitimate surrogate function. One 
can rephrase the surrogate by defining the orthogonal complement operator (1^) 
via the equation P^{Y) + Pa{Y) = Y. The matrix = Pa{Y) + P^{Xn) 
temporarily completes Y and yields the surrogate function ^11.^71 — In 

implementing a slightly modified version of the proximal distance algorithm, one 
must solve for the minimum of the Moreau function 

l\\Zn- X\\l + '^\\X - PR,{Xn)\\l. 

The stationarity condition 

0 = ^ ~ Zn + Wn[X — Pr^ (-X^n)] 


yields the trivial solution 

^n+l — 


1 


l+Wn 


l+Wn 


PRuiXn). 


Again this is guaranteed to decrease the objective function 

Ppi^) — n iVij ~ d" n diste(A', i?fe) 

(Lj)eA 
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p 

9 

a 

rank 

Li 

Li! L 2 

El 

ri/T2 

200 

250 

0.05 

20 

1598 

0.251 

4.66 

7 

800 

1000 

0.20 

80 

571949 

0.253 

131.02 

18.1 

1000 

1250 

0.25 

100 

1112604 

0.24 

222.2 

15.1 

1200 

1500 

0.15 

40 

793126 

0.361 

161.51 

3.6 

1200 

1500 

0.30 

120 

1569105 

0.235 

367.78 

12.3 

1400 

1750 

0.35 

140 

1642661 

0.236 

561.76 

9 

1800 

2250 

0.45 

180 

2955533 

0.171 

1176.22 

10.1 

2000 

2500 

0.10 

20 

822673 

0.50 

307.89 

1.9 

2000 

2500 

0.50 

200 

1087404 

0.192 

2342.32 

2 

5000 

5000 

0.05 

30 

7647707 

0.664 

1827.16 

2 


Table 6. Comparison of the MM proximal distance algorithm to Softimpute. Abbrevi¬ 
ations: p is the number of rows, q is the number of columns, a is the ratio of observed 
entries to total entries, Li is the optimal loss under MM, L 2 is the optimal loss under 
Softimpute, Ti is the total computation time (in seconds) for MM, and T 2 is the total 
computation time for Softimpute. 


for the choice Wn = p/ d\st^{XmRk)- 

In the spirit of Example SI let us derive a local Lipschitz constant based on the 
value /(O) = inequality 

\ H H (yy - ^ H -b xD 

(i,j)eA (i.j)eA (i.t)eA 

is equivalent to the inequality 

2 yijXij < x^j. 

(jj)eA (i,i)6A 


In view of the Cauchy-Schwarz inequality 




Vi 


E 




(i,j)eA y (z.i)eA 


no solution x of the constrained problem can satisfy 


E 4> 2 , E 


vl ■ 


V (ij)eA 

When the opposite inequality holds, 


(hj)eA 


l|V/(a;)||F= / yy {xi 3 -yijf< j E 


E 4^3 / ^ 


(id)eA 






(hi)eA 
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Again this tends to be a conservative estimate of the required local bound on p. 

Table [6] compares the performance of the MM proximal distance algorithm and 
a MATLAB implementation of Softimpute [53]. Although the proximal distance 
algorithm is noticeably slower, it substantially lowers the optimal loss and improves 
in relative speed as problem dimensions grow. □ 

Example 7. Sparse Inverse Covariance Estimation 

The graphical lasso has applications in estimating sparse inverse covariance 
matrices m- In this context, one minimizes the convex criterion 

-lndet0 + tr(S'0) + p||0||i, 

where 0 ~^isapxp theoretical covariance matrix, S' is a corresponding sample 
covariance matrix, and the graphical lasso penalty ||0||i equals the sum of the 
absolute values of the off-diagonal entries of 0. The solution exhibits both sparsity 
and shrinkage. One can avoid shrinkage by minimizing 

/(0) = -lndet0-f tr(S0) 

subject to 0 having at most 2k nonzero off-diagonal entries. Let be the closed 
set oipxp symmetric matrices possessing this property. Projection of a symmetric 
matrix M onto can be achieved by arranging the above-diagonal entries of M 
in decreasing absolute value and replacing all but the first k of these entries by 0. 
The below-diagonal entries are treated similarly. 

The proximal distance algorithm for minimizing /(0) subject to the set con¬ 
straints operates through the convex surrogate 

11 ) 

g {& I 0 „) = /( 0 ) + ^||0 - 

P 

Wn = , ^=- 

^||0„-PT,K®n)ll?. + e 

A stationary point minimizes the surrogate and satisfies 

0 = — 0 ^ -)- Wn® S — WuPt^ {®n)- (9) 

If the constant matrix S — WnPT^i®n) has spectral decomposition UnDnUn^ then 
multiplying equation @ on the left by U* and on the right by C/„ gives 

0 = -U*n@~^Un + WnU*n@Un + 

This suggests that we take E = Un®Un to be diagonal and require its diagonal 
entries Ci to satisfy 


0 — -\- WnCi -f dni- 
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Multiplying this identity by and solving for the positive root of the resulting 
quadratic yields 


_ —dni + \/ dni + 4:Wn 

2Wn 

Given the solution matrix En+i, we reconstruct 0„+i as UnEn+iU'^. 

Finding a local Lipschitz constant is more challenging in this example. Because 
the identity matrix is feasible, the minimum cannot exceed 

p 

— In det I + tr(S'/) = tr(S^) = uJt, 

1=1 

where S is assumed positive definite with eigenvalues uji ordered from largest to 
smallest. If the candidate matrix 0 is positive dehnite with ordered eigenvalues 
Ai, then the von Neumann-Fan inequality implies 

p p 

( 10 ) 

i—1 i—1 

To show that /(0) > f{I) whenever any Ai falls outside a designated interval, 
note that the contribution — In Xj + XjUip-j+i to the right side of inequality (fTUl) is 
bounded below by Incup-j+i +1 when Xj = Hence, /(0) > f{I) whenever 

p 

— In Ai + XiOjp-ij-i > 'y ( oJi — y~!(ln^p-j+i + !)■ (11) 

i=l j/i 

Given the strict convexity of the function — InAi + A^Wp-i+i, equality holds in 
inequality dill) at exactly two points Aimm > 0 and Aimax > Aimin- These roots 
can be readily extracted by bisection or Newton’s method. The strict inequality 
/(0) > /(/) holds when any Xi falls to the left of Aimin or to the right of Aimax- 
Within the intersection of the intervals [Aimax, Aimin], the gradient of/(0) satisfies 


||V/(0)||;^<||0-i||;^ + ||S| 


F < 


\ 


i;Ay + iisiiF< 


\ 




This bound serves as a local Lipschitz constant near the optimal point. 

Table [7] compares the performance of the MM algorithm to that of the R glasso 
package [19]. The sample precision matrix S~^ = LL* + 6MM* was generated 
by filling the diagonal and first three subdiagonals of the banded lower triangular 
matrix L with standard normal deviates. Filling M with standard normal deviates 
and choosing 5 = 0.01 imposed a small amount of noise obscuring the band nature 
of LL*. All table statistics represent averages over 10 runs started at 0 = 5“^ 
with k equal to the true number of nonzero entries in LL*. The MM algorithm 
performs better in minimizing average loss and recovering nonzero entries. □ 
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p 

h 

ki 

h 

P 

Li 

L 2 — Li 

Ti 

T 1 /T 2 

8 

18 

14.0 

14.0 

0.00186 

- 12.35 

0.01 

0.022 

43.458 

16 

42 

30.5 

28.7 

0.00305 

-25.17 

0.08 

0.026 

43.732 

32 

90 

53.5 

49.9 

0.00330 

-50.75 

0.17 

0.054 

31.639 

64 

186 

97.8 

89.3 

0.00445 

-98.72 

0.53 

0.234 

28.542 

128 

378 

191.6 

169.9 

0.00507 

-196.09 

1.14 

1.060 

18.693 

256 

762 

345.0 

304.2 

0.00662 

-369.62 

2.55 

4.253 

9.559 

512 

1530 

636.4 

566.8 

0.00983 

-641.89 

6.72 

19.324 

5.679 


Table 7. Numerical results for precision matrix estimation. Abbreviations: p for matrix 
dimension, kt for the number of nonzero entries in the true model, fci for the number of 
true nonzero entries recovered by the MM algorithm, k 2 for the number of true nonzero 
entries recovered by glasso, p the average tuning constant for glasso for a given kt, 
Li the average loss from the MM algorithm, Li — L 2 the difference between Li and the 
average loss from glasso, Ti the average compute time in seconds for the MM algorithm, 
and T 1 /T 2 the ratio of Ti to the average compute time for glasso. 


6. Discussion 

The MM principle offers a unique and potent perspective on high-dimensional op¬ 
timization. The current survey emphasizes proximal distance algorithms and their 
applications in nonlinear programming. Our construction of this new class of al¬ 
gorithms relies on the exact penalty method of Clarke [10] and majorization of 
a smooth approximation to the Euclidean distance to the constraint set. Well- 
studied proximal maps and Euclidean projections constitute the key ingredients of 
seven realistic examples. These examples illustrate the versatility of the method in 
handling nonconvex constraints, its improvement as problem dimension increases, 
and the pitfalls in sending the tuning constants p and e too quickly to their lim¬ 
its. Certainly, the proximal distance algorithm is not a panacea for optimization 
problems. For example, the proximal distance algorithm as formulated exhibits 
remarkably fickle behavior on linear programming problems. For linear program¬ 
ming, we ensure numerical stability and guard against premature convergence only 
by great care in parameter tuning and updating. Nonetheless, we are sufficiently 
encouraged to pursue this research further, particularly in statistical applications 
where model fitting and selection are compromised by aggressive penalization. 
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